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Abstract 


Assessing the probability of occurrence of extreme events is a crucial issue in various fields like 
finance, insurance, telecommunication or environmental sciences. In a multivariate framework, the 
tail dependence is characterized by the so-called stable tail dependence function (STDF). Learning 
this structure is the keystone of multivariate extremes. Although extensive studies have proved con¬ 
sistency and asymptotic normality for the empirical version of the STDF, non-asymptotic bounds 
are still missing. The main purpose of this paper is to fill this gap. Taking advantage of adapted 
VC-type concentration inequalities, upper bounds are derived with expected rate of convergence in 
The concentration tools involved in this analysis rely on a more general study of max¬ 
imal deviations in low probability regions, and thus directly apply to the classification of extreme 
data. 

Keywords: VC theory, multivariate extremes, stable tail dependence function, concentration 
inequalities, extreme data classification. 

1. Introduction 

Extreme Value Theory (EVT) develops models for learning the unusual rather than the usual. These 
models are widely used in fields involving risk management like finance, insurance, telecommunica¬ 
tion or environmental sciences. One major application of EVT is to provide a reasonable assessment 
of the probability of occurrence of rare events. To illustrate this point, suppose we want to manage 
the risk of a portfolio containing d different assets, X = (Xi,..., X^). A fairly general purpose 
is then to evaluate the probability of events of the kind {Xi > xi or ... or X^ > Xd}, for large 
multivariate thresholds x = (xi,..., x^). Under not too stringent conditions on the regularity of 
X’s distribution, EVT shows that for large enough thresholds, (see Section 2 for details) 


P{Xi > xi or ... or Xd > x^} ~ l{pi,... ,pd), 


where I is the stable tail dependence function and the pj’s are the marginal exceedance probabilities, 
Pj = P(Xj > Xj). Thus, the functional I characterizes the dependence among extremes. The 
joint distribution (over large thresholds) can thus be recovered from the knowledge of the marginal 
distributions together with the STDF 1. In practice, I can be learned from ‘moderately extreme’ data, 
typically the k ‘largest’ ones among a sample of size re, with k n. Recovering the pj’s can be 
done following a well paved way: in the univariate case, EVT essentially consists in modeling the 
distribution of the maxima {resp. the upper tail) as a generalized extreme value distribution, namely 
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an element of the Gumbel, Frechet or Weibull parametric families {resp. by a generalized Pareto 
distribution). 

In contrast, in the multivariate case, there is no finite-dimensional parametrization of the de¬ 
pendence structure. The latter is characterized by the so-called stable tail dependence function 
(STDF). Estimating this functional is thus one of the main issues in multivariate EVT. Asymptotic 
properties of the empirical STDF have been widely studied, see Huang (1992), Drees and Huang 
(1998), Embrechts et al. (2000) and de Haan and Eerreira (2006) for the bivariate case, and Qi 
(1997), Einmahl et al. (2012) for the general multivariate case under smoothness assumptions. 

However, to the best of our knowledge, no bounds exist on the finite sample error. It is precisely 
the purpose of this paper to derive such non-asymptotic bounds. Our results do not require any 
assumption other than the existence of the STDF. The main idea is as follows. The empirical 
estimator is based on the empirical measure of ‘extreme’ regions, which are hit only with low 
probability. It is thus enough to bound maximal deviations on such low probability regions. The 
key consists in choosing an adaptive VC class, which only covers the latter regions, and on the other 
hand, to derive VC-type inequalities that incorporate p, the probability of hitting the class at all. 

The structure of the paper is as follows. The whys and wherefores of EVT and the STDF 
are explained in Section 2. In Section 3, concentration tools which rely on the general study of 
maximal deviations in low probability regions are introduced, with an immediate application to the 
framework of classification (Remark 5). The main result of the paper, a non-asymptotic bound on 
the convergence of the empirical STDF, is derived in Section 4. Section 5 concludes. 

2. Background in extreme value theory 

A useful setting to understand the use of EVT and to give intuition about the STDF concept is 
that of risk monitoring. In the univariate case, it is natural to consider the (1 — p)*^ quantile of 
the distribution F of a random variable X, for a given exceedance probability p, that is Xp = 
inf{x G M, P(A > x) < p}. Eor moderate values of p, a natural empirical estimate is Xp^n = 
inf{x G M, '^Xi>x < p}- However, if p is very small, the finite sample Xi, ... ,Xn 

contains insufficient information and Xp^n becomes irrelevant. That is where EVT comes into play 
by providing parametric estimates of large quantiles: whereas statistical inference often involves 
sample means and the central limit theorem, EVT handles phenomena whose behavior is not ruled 
by an ‘averaging effect’. The focus is on the sample maximum rather than the mean. The primal 
assumption is the existence of two sequences {an, n > 1} and {bn,n > 1}, the a^’s being positive, 
and a non-degenerate distribution function G such that 



lim n P 


( 1 ) 


for all continuity points x G K of G. If this assumption is fulfilled - it is the case for most textbook 
distributions - then F is said to be in the domain of attraction of G, denoted F G DA{G). The tail 
behavior of F is then essentially characterized by G, which is proved to be - up to rescaling - of the 
type G(x) = exp(— (l-|-7x)“^/'^) for I-I-7X > 0, 7 G M, setting by convention {l+^x)~^/'^ = 
for 7 = 0 . The sign of 7 controls the shape of the tail and various estimators of the rescaling 
sequence as well as 7 have been studied in great detail, see e.g. De kk ers et al. (1989), Einmahl et al. 
(2009), Hill (1975), Smith (1987), Beirlant et al. (1996). 
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In the multivariate case, it is mathematically very convenient to decompose the joint distribution 
of X = ... ,X'^) into the margins on the one hand, and the dependence structure on the 

other hand. In particular, handling uniform margins is very helpful when it comes to establishing 
upper bounds on the deviations between empirical and mean measures. Define thus standardized 
variables = 1 — Fj{X^), where Fj is the marginal distribution function of X^, and U = 
{U ^,..., [/“*). Knowledge of the Fj’s and of the joint distribution of U allows to recover that of 
X, since P(Xi < xi,... ,Xd < Xd) = P(C/^ > 1 — Fi(xi), ... ,U^ > 1 — Fd{xd))- With these 
notations, under a fairly general assumption similar to (1) (namely, standard multivariate regular 
variation of standardized variables, see e.g. Resnick (2007), chap. 6), there exists a limit measure A 
on [0, oo]'^ \ {oo} (called the exponent measure) such that 


lim t ^P 
t^o 


< t xi or ... or <txd 


A[x, oo]'^ := /(x) . 


{xj E [0,oo],x / oo) (2) 


Notice that no assumption is made about the marginal distributions, so that our framework allows 
non-standard regular variation, or even no regular variation at all of the original data X (for more 
details see e.g. Resnick (2007), th. 6.5 or Resnick (1987), prop. 5.10.). The functional I in the 
limit in (2) is called the stable tail dependence function. In the remainder of this paper, the only 
assumption is the existence of a limit in (2), i.e., the existence of the STDF. 

We emphasize that the knowledge of both I and the margins gives access to the probability of 
hitting ‘extreme’ regions of the kind [Ojx]'^, for ‘large’ thresholds x = (xi,... ,Xd) (i.e. such that 
for some j < d, 1 — Fj{xj) is a 0{t) for some small t). Indeed, in such a case. 


P(X^ > xi or 


or 


x-' > xrf)=p I u (1 - < (1 - m 

i=i 




t Ip 


ki=i 


{l-Fj){xj) 


tZt) ■■■, t ^(1 


= ( (1-Fi)(xi), ..., (1-F„ 


where the last equality follows from the homogeneity of 1. This underlines the utmost importance 
of estimating the STDF and by extension stating non-asymptotic bounds on this convergence. 

Any stable tail dependence function l{.) is in fact a norm, (see Falket al. (1994), pl79) and 
satisfies 

max{xi,... ,x„} < ((x) < xi + ... + Xd, 

where the lower bound is attained if X is perfectly tail dependent (extremes of univariate marginals 
always occur simultaneously), and the upper bound in case of tail independence or asymptotic 
independence (extremes of univariate marginals never occur simultaneously). We refer to Falk et al. 
(1994) for more details and properties on the STDF. 


3. A VC-type inequality adapted to the study of low probability regions 

Classical VC inequalities aim at bounding the deviation of empirical from theoretical quantities on 
relatively simple classes of sets, called VC classes. These classes typically cover the support of the 
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underlying distribution. However, when dealing with rare events, it is of great interest to have such 
bounds on a class of sets which only covers a small probability region and thus contains (very) few 
observations. This yields sharper bounds, since only differences between very small quantities are 
involved. The starting point of this analysis is the following VC-inequality stated below. 


Theorem 1 Let Xi,..., i.i.d. realizations of a r.v. X, a VC-class A with VC-dimension 1/4 
and shattering coefficient (or growth function) 5^(n). Consider the class union A = 
and let p = P(X E A). Then there is an absolute constant C such that for all 0 < 6 < 1, with 
probability at least 1 — 5, 


sup 

A(zA 


P[X E A] 


1 

n 


i=l 


< c 


Vp\ — log - + - log - 
" n on 0 


( 3 ) 


Proof (sketch of) Details of the proof are deferred to the appendix section. We use a Bemstein-type 
concentration inequality (McDiarmid (1998)) that we apply to the general functional 

1 ” 

■ 1 
2 = 1 

where Xi:„ denotes the sample (Xi,..., X^). The inequality in McDiarmid (1998) involves the 
variance of the r.v. /(Xi,..., X^, x^+i, ...,Xn)- /(Xi,..., Xfc_i, Xfc,..., x„), which can easily 
be bounded in our setting. We obtain 

_ nP 

F[f{Xi,n)-Ef(Xi,n) > t] < e ( 4 ) 

where the quantity q = K (sup 4£_4 |1x'gA — IxgaI) (with X' an independent copy of X) is a 
measure of the complexity of the class A with respect to the distribution of X. It leads to high 

probability bounds on /(Xi:„) of the form E/(Xi:„) + Mog(l/J) + log(l/<5) instead of 

the standard Hoeffding-type bound E/(Xi:„) + Y^^^^og(l/^ . It is then easy to see that q < 
2sup4g^P(X E A) < 2p. Finally, an upper bound on E/(Xi:,i) is obtained by introducing re¬ 
normalized Rademacher averages 

1 

TZn p = E sup — 

’ AeA np 


^O-ilXiGA 

2=1 


fi'Xi^n) = sup 
AgA 


which are then proved to be of order 



sothatE(/(Xi:„)) <C 



Remark 2 (Comparison with Existing Bounds) The following re-normalized VC-inequality 
due to Vapnik and Chervonenkis (see Vapnik and Chervonenkis (1974), Anthony and Shawe-Taylor 
(1993) or Bousquet et al. (2004), Thm 7), 


sup 

AgA 


p(XEA)-iEr=iix,GA 

y/IP(X E A) 


< 


'log5'^(2n) +log| 


n 


( 5 ) 
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which holds under the same conditions as Theorem 1, allows to derive a bound similar to (3), but 
with an additional log n factor. Indeed, it is known as Sauer’s Lemma (see Bousquet et al. (2004)- 
lemma 1 for instance) that for n > V^, S_/[{n) < It is then easy to see from (5) that: 


sup 

AeA 


P(X E A) 


1 

n 


i=l 


< 2 


sup P(X 
AgA 


E A)^ 


’VAlog^+logl 


n 


Introduce the union A of all sets in the considered VC class, A = UagaA and let p = ¥ (X E A). 
Then, the previous bound immediately yields 


sup 

AgA 


P(X E A) 


1 

n 


i=l 


< ‘^Vp\ 


'y^iogf^ + iogf 


n 


Remarks (S IMPLER Bound) If we assume furthermore that 5 > e then we have: 


sup 

AgA 


P(X E A) 


1 

n 


i=l 


< CVpV —logi 


n 


Remark 4 flNTERPRETATlONj Inequality (3) can be seen as an interpolation between the best 
case (small p) where the rate of convergence is 0(l/n), and the worst case (large p) where the 
rate is 0{l/y/n). An alternative interpretation is as follows: divide both sides of (3) by p, so 
that the left hand side becomes a supremum of conditional probabilities upon belonging to the 
union class A, {P(X E A|X E A)}yigA. Then the upper bound is proportional to e{np, 6) where 

e{n,5) := ^ log^ + ^ log | is a classical VC-bound; np is in fact the expected number of 

observations involved in (3), and can thus be viewed as the effective sample size. 


Remarks fCLASSlFlCATlON of Extremes) A key issue in the prediction framework is to find 
upper bounds for the maximal deviation sup^gg \Ln{g) — L(g)\, where L(g) = P( 5 (X) Y) is 
the risk of the classifier 51 : A” —)• { —1,1}, associated with the r.v. (X, Y) E x {—1,1}. Ln{g) = 
n is the empirical risk based on a training dataset {(Ki,Yi), ..., (X„,y„)}. 

Strong upper bounds on sup^gg \ Ln{g) — L(g) \ ensure the accuracy of the empirical risk minimizer 
gn := avgraiiig^g Ln{g). 

In a wide variety of applications (e.g. Finance, Insurance, Networks), it is of crucial importance 
to predict the system response Y when the input variable X takes extreme values, corresponding 
to shocks on the underlying mechanism. In such a case, the risk of a prediction rule g'(X) should 
be defined by integrating the loss function L{g) with respect to the conditional joint distribution 
of the pair (X.,Y) given X is extreme. For instance, consider the event {||X|| > ta} where ta is 
the (1 — quantile 0/ ||X|| for a small a. To investigate the accuracy of a classifier g given 
{||X|| > ta}, introduce 

La{g) : = -P (y / ( 7 (X), ||X|| >ta) = P (y / 9(X) I ||X|| > ta) , 

a VI/ 
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and its empirical counterpart 

1 


{Yi^gO^i), ||Xi||>||X(L„„j)||} > 


2=1 


where ||X(i)|| > ... > ||X(„)|| are the order statistics o/||X||. Then as an application of Theorem 
1 with A = {(x, y),g{x) / y, ||x|| > g & Q,we have : 


sup 

g^G 


Lia,n{9) ^aid) 


< c 


Vg, 11, 1 

— log 7 H-log 7 

na 0 na o 


( 6 ) 


'We refer to the appendix for more details. Again the obtained rate by empirical risk minimization 
meets our expectations (see remark 4 ), insofar as a is the fraction of the dataset involved in the 
empirical risk La,n- We point out that a may typically depend on n, a = an ^ 0. In this context 
a direct use of the standard version of the VC inequality would lead to a rate of order l/{anyfi), 
which may not vanish as n ^ +oo and even go to infinity if an decays to 0 faster than 1 / ^/n . 

Let us point out that rare events may be chosen more general than {||X|| > ta}, say {X G Q} 
with unknown probability q = P({X G Q}). The previous result still applies with LQ{g) : = 
P (y 7 ^ g((X),X G Q) and Lq^nid) ■= Pn iX A 9X-)X ^ Q)>' obtained upper bound 

on supggg I Lqig) - Lq^d) « of order 0{l/y/qn). 

Similar results can be established for the problem of distribution-free regression, when the error 
of any predictive rule /(x) is measured by the conditional mean squared error E[(y — /(X))^ | 
Z > qaf\, denoting by Z the real-valued output variable to be predicted from X and by Qa its 
quantile at level 1 — a. 


4. A bound on the STDF 

Let us place ourselves in the multivariate extreme framework introduced in Section 1: Consider a 
random variable X = (X^,... X'^) in ISf with distribution function F and marginal distribution 
functions Fi ,..., Fd- Let Xi, X 2 ,..., Xn be an i.i.d. sample distributed as X. In the subsequent 
analysis, the only assumption is the existence of the STDF defined in (2) and the margins Fj are 
supposed to be unknown. The definition of I may be recast as 

l{x) := limt~^F{tx) (7) 

t->-0 


with F{x) = (1 — y)((l — Fi)^(xi),..., (1 — FdX{xd)). Here the notation (1 — FjX{xj) 
denotes the quantity sup{y : 1 — Fj{y) > Xj}. Notice that, in terms of standardized variables Uf 

F(x) = P(U,ti{t^'' < xj}) = P(U G [x,oor). 

Let k = k{n) be a sequence of positive integers such that A: —)• 00 and k = o(n) as n — cxd. 
A natural estimator of I is its empirical version defined as follows, see Huang (1992), Qi (1997), 
Drees and Huang (1998), Einmahl et al. (2006): 


^n(x) — 


1 


I Y.Hxl>xl 

2=1 


n — J ■I'l) 


or ,, 1 ^,,} 

» — (n- +1) J 


( 8 ) 


6 









Learning rates for the dependence structure of rare events 


The expression is indeed suggested by the definition of I in (7), with all distribution functions and 
univariate quantiles replaced by their empirical counterparts, and with t replaced by k/n. Extensive 
studies have proved consistency and asymptotic normality of this nonparametric estimator of I, 
see Huang (1992), Drees and Huang (1998) and de Haan and Ferreira (2006) for the asymptotic 
normality in dimension 2, Qi (1997) for consistency in arbitrary dimension, and Einmahl et al. 
(2012) for asymptotic normality in arbitrary dimension under differentiability conditions on 1. 

To our best knowledge, there is no established non-asymptotic bound on the maximal deviation 
supo<x<T Kn(x) — f(x)|. It is the purpose of the remainder of this section to derive such a bound, 
without any smoothness condition on 1. 

First, Theorem 1 needs adaptation to a particular setting: introduce a random vector Z = 
{Z^,..., Z^) with uniform margins, i.e., for every j = 1,..., d, the variable Z^ is uniform on 
[0,1]. Consider the class 


A 


'k 

— X, oo 
-n 


X G 


0 < Xj < T (1 < j < d) I 


This is a VC-class of VC-dimension d, as proved in Devroye et al. (1996), Theorem 13.8, for its 
complementary class |[x, oo[, x > O}. In this context, the union class A has mass p < dT^ since 


P(Z G A) 


r / 

[k 

rd\ T 


ZG 

—T, oo 


= P 

V 

Ln 

L J . 



U 


< 


j=l..d 



n 


et, 

VI 

1- 

1- 

1 g 
V 

N 

1—1 

II 



Consider the measures Cn{ •) = ^ C'(x) = P(Z G •). As a direct consequence 

of Theorem 1 the following inequality holds true with probability at least 1 — d. 


n 

sup - 
o<x<r K 


C'n(-[x,00['') - C'(-[X,00['') 
n n 




If we assume furthermore that 6 > e then we have 


n 

sup - 
0<x<T K 


a(^[x,oon-c(^[x,oon 


< Cd 



(9) 


Inequality (9) is the cornerstone of the following theorem, which is the main result of the paper. 
In the sequel, we consider a sequence k{n) of integers such that k = o(n) and k{n) —)• oo. For 
notational convenience, we often drop the dependence in n and simply write k instead of k{n). 

Theorem 6 Let T be a positive number such that T > + 1), cind 6 such that 6 > e Then 

there is an absolute constant C such that for each n > 0, with probability at least 1 — 5: 


sup |(„(x)-((x)| 
o<x<r 


IT d + 3 

< Cd\j — log —-h sup 

h 0 o<x<2r 


n ~,k . ,, , 

-F(-x) -((x) 
k n 


( 10 ) 


The second term on the right hand side of (10) is a bias term which depends on the discrepancy 
between the left hand side and the limit in (2) or (7) at level t = k/n. The value k can be interpreted 
as the effective number of observations used in the empirical estimate, i.e. the effective sample 
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size for tail estimation. Considering classical inequalities in empirical process theory such as VC- 
bounds, it is thus no surprise to obtain one in 0{l/y/k). Too large values of k tend to yield a large 
bias, whereas too small values of k yield a large variance. For a more detailed discussion on the 
choice of k we recommend Einmahl et al. (2009). 

The proof of Theorem 6 follows the same lines as in Qi (1997). For unidimensional random 
variables Ti,... ,Yn, let us denote by < ... < their order statistics. Define then the 
empirical version of F (introduced in (7)) as 




{Ul<xi or ... or Uf<Xd} ’ 


2=1 


so that ^Fn(|x) = i or... or ' Noticc that the (//’s are not observable 

(since Fj is unknown). In fact, will be used as a substitute for allowing to handle uniform 
variables. The following lemmas make this point explicit. 


Lemma 7 (Link between In and Fn) The empirical version of F and that of I are related via 


■■■ ^^{\kxd\))- 

Proof Consider the definition of in (8), and note that for j = 1,..., d, 

Xf > rank{Xl) >n- [kxj\ + 1 

rank{Fj{xi)) > n — [kxj] + 1 
rank{l — Fj{xj)) < [kxj\ 

I - ilkxjl)^ 

SO that („(x) = i Z]=l or ... or 


Lemma 8 (Uniform bound on F„’s deviations) For any finite T > 0, and 5 > e with proba¬ 
bility at least 1 — 5, the deviation of Fn from F is uniformly bounded: 


n ~ k 

sup -F„(-x) 
o<x<T k n 


n ~,k , 


< \ogl 


Proof Notice that 


sup 

o<x<r 


n ~ ,k , 
-rFn{-^) 
k n 


n ~,k , 

-F(-x) 
k n 


n 


- Vi 


n 


{UiG^]x,oo]N 


-P 


2=1 


U e —]x, cxd] 
n 


and apply inequality (9). 
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Lemma 9 (Bound on the order statistics of U) Let S > e For any finite positive number T > 
0 such that T > 7/2((log d)/k + 1), we have with probability greater than 1 — 5, 

71 

V1 < j < d, 

and with probability greater than 1 — (d + 1)5, 


max sup 

t<3<d 0<xj<T 


[kxj 


n 


- 




T 1 

< C,/-log-. 


Proof Notice that supp^T] ■ J) = ^nit) = ^ ELi ^{uUt} ■ straight¬ 

forward to see that 


—U^ 


<2T ^ 


n 


n 


so that 


n 


^i-p^{lkT\}>^T] <P( sup 


r.(t) 


> 2 


Using Wellner (1978), Lemma l-(ii) (we use the fact that, with the notations of this reference, 
h{l/2) > 1/7 ), we obtain 

/n i \ 2kT 


and thus 

as required in (11). Yet, 

[kxj 


3i. ^gWj) > 27-1 < < e-* < i 


sup 


- -Ul 


k k (J) 


= sup 


n 

= — sup 


iVl ■ --W 

i=l ■' 




Ui<U^n^ 

1 — {\kxj\) 


i=l 


Tl 


where 0, (!,) = J i E".i 1 
than 1 — 5, 




Ui < 

i — 


Then, by (11), with probability greater 


\kxj\ 

max sup — 
t<3<do<xj<T k 


-W 

p'^llkxji) 


< max sup Qj{y) 
t<3<do<y<2T 


and from (9), each term supo<y< 2 T ^jiv) is bounded by Uy ^ log ^ (with probability 1 — 5). In 
the end, with probability greater than 1 — (d + 1)5 : 


max sup 
t<j<dQ<y< 2 T 


&3iy) < c’ 
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which is the desired inequality 


We may now proceed with the proof of Theorem 6. First of all, noticing that F{tx.) is non¬ 
decreasing in Xj for every I and that I (x) is non-decreasing and continuous (thus uniformly contin¬ 
uous on [0, from (7) it is easy to prove by subdivising [0, (see Qi (1997) p.l74 for details) 
that 


sup 

0<x<T 


-F(fx) - /(x) 


0 as t —^ 0 . 


( 12 ) 


Using Lemma 7, we can write : 
sup |(„(x)-((x)| = sup 


0<x<T 


n 




0<x<T 

< sup 
0<x<T 

-|- sup 
o<x<r 

-|- sup 

0<X<T 




d 

d 


-/(x) 






d 

{[kxd]) 


n 








{[kxd\) 


-1 i 


‘ IF • • • > 


• • • > f^'-^{lkxd\) 

=: A(n) + H(n) -I- T(n) . 

Now, by (11) we have with probability greater than 1 — <5 : 


- Kx) 


-U^ 

’ k (^kxd\) 


A(n) < sup 
o<x< 2 r 


n ~ ,k . n ~,k , 
-F4-x) - tF(-x) 
k n k n 


and by Lemma 8, 


A(n) < Cd 


2r, 1 


with probability at least 1 — 25. Similarly, 


l{n) < sup 
0<x<2T 


n ~,k . n,,k . 


sup 

0<x<2T 


n ~,k . ,, , 


^ 0 


(bias term) 


by virtue of (12). Concerning T(n), we have : 


T(n) < sup 

0<x<T 


-r/i 




i[kxd\) 


)-'< 




+ 


sup 

0<x<T 


[kxi\ 

k 


[kxd\ 

k 


) - ^(x) 


= Ti(n) + T 2 (n) 


[kxd\ 

k 
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Recall that I is 1-Lipschitz on [0, T]'^ regarding to the ||.||i-norm, so that 

d 


_ n j 
k k 


Ti(n) < sup 
o<x<r 

so that by Lemma 9, with probability greater than 1 — (d + 1)(5: 


'2T 1 

Ti(n) < CdW —log-. 


On the other hand, T 2 (n) < supo<x<T Yli=i 
with probability at least 1 — (d + 3)(5: 


\kXj] 


< i Finally we get, for every n > 0, 


sup |/n(x)-/(x)| < A(n) + Ti(n) + T 2 (n) + H(n) 
0<x<T 


< Cd 


2T . 1 

T 


1 

l2T, 

1 

d 


~ / s n,,k ^ 


/^log 

'5 

+ 

+ 

sup 

0<x<2T 



2T 1 

< C'd\l —log- + sup 

K 0 0<X<2T 


n ~,k , ,, , 

-F(-x) -/(x) 
k n 


5. Discussion 

We provide a non-asymptotic bound of VC type controlling the error of the empirical version of 
the STDF. Our bound achieves the expected rate in 0{k~^^‘^) + bias(A:), where k is the number 
of (extreme) observations retained in the learning process. In practice the smaller k/n, the smaller 
the bias. Since no assumption is made on the underlying distribution, other than the existence of 
the STDF, it is not possible in our framework to control the bias explicitly. One option would be 
to make an additional hypothesis of ‘second order regular variation’ (see e.g. de Haan and Resnick, 
1996). We made the choice of making as few assumptions as possible, however, since the bias 
term is separated from the ‘variance’ term, it is probably feasible to refine our result with more 
assumptions. 

For the purpose of controlling the empirical STDF, we have adopted the more general framework 
of maximal deviations in low probability regions. The VC-type bounds adapted to low probability 
regions derived in Section 3 may directly be applied to a particular prediction context, namely 
where the objective is to learn a classifier (or a regressor) fhaf has good properfies on low proba- 
bilify regions. This may open fhe road fo fhe sfudy of classification of exfremal observations, wifh 
immediafe applicafions fo fhe field of anomaly defecfion. 
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Appendix A. Proof of Theorem 1 


Theorem 1 is actually a short version of Theorem 10 below: 

Theorem 10 (Maximal deviations) Let Xi, ..., X„ i.i.d. realizations of a r.v. X valued in a 
VC-class A, and denote by TZn,p the associated relative Rademacher average defined by 


TZn p = sup — 
’ AeA np 


n 

'^aAxieA 

i=l 


(13) 


Define the union A = and p = P(X G A). Fix 0 < <5 < 1, then with probability at least 

1 - 5 , 


- sup 
P AeA 


P(XG A)--Vlx, 

n ' 


i&A 


i=l 


1 


< 27^„„ -|- -—log — -|- 2 i —log—, 


3np 


1 . 1 


np 


and there is a constant C independent ofn,p, 5 such that with probability greater than 1 — 5, 


sup 

AeA 


1 


P(XGA)--j;ix,eA 

n 


2 = 1 


< c|v^;y-^iogi + Logi 


If we assume furthermore that 5 > e then we both have: 


1 

- sup 
P AeA 


P(X € A) 


1 

- sup 
P AeA 


P(X G A) 


1 

n 


2=1 


1 

n 


2=1 




1 

np 


1 

5 


< C 
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In the following, Xi:„ denotes an i.i.d. sample (Xi,..., X„) distributed as X, a M'^-valued 
random vector. The classical steps to prove VC inequalities consist in applying a concentration 
inequality to the function 


/(Xi:„) := sup 
A£A 


P(X € A) 


1 


(14) 


and then establishing bounds on the expectation E/(Xi:„), using for instance Rademacher average. 
Here we follow the same lines, but applying a Bernstein type concentration inequality instead of 
the usual Hoeffding one, since the variance term in the bound involves the probability p to be in the 
union of the VC-class A considered. We then introduce relative Rademacher averages instead of 
the conventional ones, to take into account p for bounding E/(Xi:„). 

We need first to control the variability of the random variable /(Xi:„) when fixing all but one 
marginal Xj. For that purpose introduce the functional 


/l(xi, . . . , Xfc) = E [/(Xi:„)|Xi = Xi, . . . , Xfc = Xfc] - E [/(Xi:„)|Xi = Xi, . . . , Xfc_l = Xfc_l] 
The positive deviation of /i(xi,..., Xfc_i, X^.) is defined by 

deu+(xi,... ,Xfc_i) = sup {/i(xi,...,xfc_i,x)}, 

and maxdev^, fhe maximum of all posifive deviafions, by 

maxhev"*" = sup max deu’'"(xi,... ,Xfc_i) . 


Finally, define v, fhe maximum sum of variances, by 

n 

sup ^ Var/i(xi,... ,Xfc_i,X* 


V = 


Xl,...,Xn 


k=l 


We have now fhe tools to state an extension of the classical Bernstein inequality, which is proved in 
McDiarmid (1998). 

Proposition 11 Let Xi:n — (Xi,..., X„) as above, and f any function -A M. Let maxdev~^ 

and V the maximum sum of variances, both of which we assume to be finite, and let p be the mean 
of f(X.i:n)- Then for any t > 0, 

P[/(X,„)-M><] < exp . 

Note that the term is view as an ‘error term’ and is often negligible. Let us apply this 

theorem to the specific function / defined in (14). Then fhe following lemma holds frue: 


Lemma 12 In the situation of Proposition 11 with f as in (14), we have 

maxdev'^ < — and v < —, 
n n 
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where 


q = E ( sup |1 x'ga - IxgaI ) < 2E ( sup |1 x'ga1x^aI ) , (15) 

Vag^ / VagA / 

with X' an independent copy ofiK. 

Proof Considering the definition of /, we have: 


/l(xi, . . . ,XA;_ 1 ,XA;) = E sup 

AgA 


^ k 1 ^ 

E(X G ^) - -^IxiGA - - ^XiGA 

2=1 i=k-\-l 


- E sup 
AgA 


^ k—1 ^ n 

P(X G ^ IxiGA ~ IXiGA 


2=1 


i=k 


Using the fact that | sup^g _4 |F(^)|— sup^g _4 |G(74)|| < sup ^£_4 \F{A)—G{A)\ for every function 
F and G of A, we obtain: 

|/i(xi,... ,Xfc_i,Xfc)| < E sup - |1 x;,gA - Ix^gaI • (16) 

AgA ^ 


The term on the right hand side of (16) is less than ^ so that maxdev"'" < Moreover, if X' is an 
independent copy of X, (16) yields 


|/i(xi,... ,Xfc_i,X')| 


< E 


sup — |1 x'gA - IxgaI 
_A6A ^ 



SO that 


E[fi(xi,...,Xfc_i,X')2] 


< E E sup - |1x'gA - IxgaI 
AgA ^ 

< E sup |1x'gA - IxgaI^ 
AgA ra 


< ^E 


sup |1 x'gA - IxgaI 
AgA 



Thus Var(/i(xi,... ,Xfc_i,Xfc)) < E[/i(xi,... ,Xfc_i,Xfc)^] < Finally v < ^ as required. ■ 


As a consequence with Proposition 11 the following general inequality holds true: 

P[/(Xi^„)-E/(Xi;0 > t] < e ^ (17) 

where the quantity q = E (sup^g _4 |1x'gA ~ IxgaI) seems to be a central characteristic of the 
VC-class Al given the distribution X. It may be interpreted as a measure of the complexity of 
the class Al with respect to the distribution of X: how often the class Al is able to separate two 
independent realizations of X. 
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Recall that the union class A and its associated probability p are defined as A = and 

p = P(X G A). Noting that for all A G A, < 1 {.gA}> it is then straightforward from (15) 

that q < 2p. As a consequence (17) holds true when changing q by 2p. Let us now explicit the link 
between the expectation of / and the Rademacher average 




E sup - 
AeA ^ 


n 

^CTilXiGA 

i=l 


where {ai)i>i is a Rademacher chaos independent of the Xj’s. 
Lemma 13 With this notations the following inequality holds true: 


E/(Xi;„) < 27^, 


Proof The proof of this lemma relies on classical arguments: Introducing a ghost sample (X^ ) i<i<n 
namely i.i.d independent copy of the Xj’s, we may write: 


E/(Xi:„) = E sup 
AeA 


E sup 
A&A 


1 ” 


2=1 


E 


1 ^ 

^Z^^x'gA 


2 = 1 


n ^ 


XiSA 


i=l 


< Esup 
AeA 


E sup 
AgA 


< E sup 
AeA 

= 27^„ 


^ n 1 ^ 

E‘x'ea - - E 


2=1 


2=1 


1 ^ 

-^Uj (lx'g^-lx,eA 
2=1 

1 "" 


2=1 


+ sup 

1 "" 

- V-CTjlXiGA 

AgA 

n ^^ 

2=1 


Combining (17) with Lemma 13 and the fact that q < 2p gives: 

_ 

P[/(Xl^„)-27^„ >f] < e ^ . 


(18) 


Recall that the relative Rademacher average are defined in (13) as TZn,p = T^n/p- It is well- 
known thaf TZn is of order ©((lA/n)^/^), see Koltchinskii (2006) for insfance. However, we hope a 
sfronger bound than just Tln,p = ^(^“^(fA/n)^/^) since \Ya=i f^i^XiSAl with P(Xi G A) = p 
is expected to be like ^ IX]”=i o'^lVigAl with Yj such that P(Yj G A) = 1. The result below 
confirms this heuristic: 


Lemma 14 The relative Rademacher average TZn,p 


is of order O 
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Proof Let us defined i.i.d. r.v. Yi independent from X,; whose law is the law of X conditioned 
on the event X G A. If = means equal in distribution it is easy to show that XlILi = 

EiLi o-i^YiGA, where k Bin{n,p) independent of the Y/s. Thus, 


7^ 


n,p 


E sup — 

AgA^P 


n 

'^CFilXiGA 

i=l 


E sup — 

A&A^P 


K 

i=l 


E 

E 

1 

sup — 

K 

V'cTilViSA 

K 



AeA np 

2=1 



E[ch(Ar)] 


where 


Thus, 


(l){K) = E 


1 

sup — 
AeAnp 


K 

^fJilYiGA 

i=l 


^KCVV^ 

— rYT 

np np 


n 


n,p 



< 



CWa 

^/np 


Finally we obtain from (18) and Lemma 14 the following bound: 


1 

- sup 
P AeA 


P(XgA)--^1x,gA 


2 = 1 


-27^n,p > t 


< e 


(19) 


Solving exp 


npt^ 


= 5 with f > 0 leads to 


1 , 1 

t = X—log- + 
6np 0 


(;^logi) +—logi := h{6) 

\6np 0 J np 0 


so that 


P 


1 

- sup 
P AeA 


1 


2=1 


- 27^n,p > h{6) 


< 6 


Using ^/a + b < ^ + \fh if a, 6 > 0, we have h{S) < ^ log ^ + 2^ ^ log In the case of 
d > e-’^P, ^ log I < |V^log| so that h(d) < sJ^logj. This ends the proof. 
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Appendix B. Note on Remark 5 

To obtain the bound in (6), the following easy to show inequality is needed before applying Theo¬ 
rem 1 : 


sup \La,n{g) 
g&Q 


La{g)\ < - 
a 


sup 

g^Q 


’(y/5(x), ||x|| 


{Yi^g(Xi), ||Xi||>t^} 


+ 


i=l 

n 


p(iixii 


I } 


i=l 


1 

+ — 
n 


Note that the final objective would be to bound the quantity sup^gg \La{g) — Tq(p*)|, where 
p* is a Bayes classifier for fhe problem af sfake, i.e. a solution of fhe condifional risk minimiza¬ 
tion problem lufj^ La{g). Such a bound involves a bias term inf^gg L^ig) — Laig^)^ 

as in fhe classical selling. Furlher, if can be shown lhaf fhe slandard Bayes classifier fl'*(x) := 
2 I{t/(x) > 1/2} — 1 (where t/(x) = P(y = 1 | X = x)) is also a solution of fhe condi¬ 
tional risk minimization problem. Finally, fhe conditional bias inig^g La{g) — La{ga) can be 
expressed as ^ inf^gg E [|2r/(X) — l|lg{x)7^g*(x)l||x||>ta] > compared wilh fhe slandard bias 
infgggE [|2r/(X) - l|lg(x)^g*(X)] ■ 
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